========================================================

Univariate Plots Section

Univariate Analysis

Dataset structure

The red wine dataset has 1599 observations and 12 features(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality). First thing I noticed is that all the variables are numerical variables, there is no factor variable. Second thing is that the quality variable is int and the range is from 3 to 8 and has 5 degrees. So it can be transfered to a factor variable.

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Main features of interest in the red wine dataset

There are some variables with great gap between the minimum and max values. I will probably trim the outliers when visualizing these variables. Most of the variables have normal distribution, but many of them are left skewed. I will do some tranformation when visualizing so they can be bell shaped. These distribution can be a help to my analysis because they all comply to the normal distribution. This graph shows the acidity distribution of red wind. It is a normal distribution.

# frequency polygon of volatile acidity.
ggplot(aes(volatile.acidity),data=reds)+
  geom_freqpoly(binwidth = 0.03)+
  scale_x_continuous(breaks = seq(0,1.7,0.1))+
  labs(title = "Frequency Polygon of Volatile Acidity", x = "volatile acidity")

This frequency polygon indicates that the volatile acidity distribution is also normal distribution.

# normal distribution of quality.
ggplot(aes(quality), data = reds)+
  geom_histogram(binwidth=1)+
  scale_x_continuous(breaks = seq(3,8,1))+
  labs(title = "Normal Distribution of Quality", x = "quality")

Quality is also a normal distribution.

Reisdual sugar is a left skewed, so I transfered the x axis to make it a normal distribution.

Chlorides is also a left skewed distribution, a log10 transformation of the x axis can make it normal.

New variables are created.

I created two new variables: 1. the ratio of fixed acidity and volatile acidity. fixed.acidity/volatile.acidity 2. the percentage of free sulfur dioxide. free.sulfur.dioxide/total.sulfur.dioxide

Bivariate Plots Section

## Saving 7 x 5 in image

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

from the plot above it can be seen that there are variables possibly have strong relationship with others. For example, the fixed.acidity have a positive relation with citric acidity and the volatile acidity have negative relation with it.

# fixed.acidity and citric.acid
acidity1 <- ggplot(reds,aes(citric.acid,fixed.acidity))+
  geom_point(alpha = 0.3)+
  geom_smooth(method = "lm")+
  ggtitle("Fixed Acidity and Citric Acid")

# volatile.acidity and citric.acid
acidity2 <- ggplot(reds,aes(citric.acid,volatile.acidity))+
  geom_point(alpha = 0.3)+
  geom_smooth(method="lm")+
  ggtitle("Volatile Acidity and Citric Acid")

grid.arrange(acidity1,acidity2)

From the plot above we can see that the fixed.acidity has a positive relationship with citric.acid, and volatile.acidity has a negative relationship with citric.acid.

The box plot shows some relationship between the quality and fixed.acidity, but not very strong.

The relationship between quality and volatile acidity is clearly stronger.

The alcohol also has a visible relationship with quality. High quality red wine often has higher degree of alcohol.

Strongest relationship I found

The strongest relationship was beween the fixed.acidity and pH according to Pearson correlation tests which indicated a correlation of -0.68. The strongest correlatiion about red wine quality is that between volatile acidity and quality. The correlation between these two variables is -0.39, higher than any other variable, witch suggests that wine with higher quality often have higher volatile acidity.

qcor <- function(x){
  with(reds,cor.test(quality,x))
}

qcor(reds$volatile.acidity)
## 
##  Pearson's product-moment correlation
## 
## data:  quality and x
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
#qcor(reds$fixed.acidity)
#qcor(reds$citric.acid)
#qcor(reds$sulfur.dioxide.per)
#qcor(reds$free.sulfur.dioxide)
#qcor(reds$sulfur.dioxide.per)
#qcor(reds$sulphates)
#qcor(reds$acidity.ratio)
#qcor(reds$density)
#qcor(reds$alcohol)
#qcor(reds$pH)

#with(reds,cor.test(fixed.acidity,density))
with(reds,cor.test(fixed.acidity,pH))
## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782
#with(reds,cor.test(fixed.acidity,citric.acid))
#with(reds,cor.test(volatile.acidity,citric.acid))

The plot above shows the most related features. 0.7 fixed.acidity & citric.acid 0.7 fixed.acidity & density -0.7 fixed.acidity & sulphates -0.6 volatile.acidity & citric.acid 0.5 quality & alcohol

Below is a graph showing the relationship between volatile acidity and quality. The orange line is the mean of volatile acidity.

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There were features like fixed.acidity and citric.acid strengthened each other. The dataset generally suggests that the more citric acid there are in the wine, the more fixed acidity the wine has.

Were there any interesting or surprising interactions between features?

The most interest interaction is between the alcohol and quality. I figured the wine quality does not have connection with alcohol, but it turned out most of the high quality wine has higher degree of alcohol.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a model to predict the red winn quality with two most related features volatile and alcohol. The model is
quality = volatile.acidity-1.3836 + alcohol 0.3138 The strength is that it only needs two features to predict the quality. The limitation is that it is obviously

## 
## Call:
## lm(formula = quality ~ volatile.acidity + alcohol, data = reds)
## 
## Coefficients:
##      (Intercept)  volatile.acidity           alcohol  
##           3.0955           -1.3836            0.3138

I also tried to predict the quality with all the features.

m1 <- lm(quality ~ fixed.acidity, data = reds)
m2 <- update(m1, ~.+volatile.acidity)
m3 <- update(m2, ~.+citric.acid)
m4 <- update(m3, ~.+residual.sugar)
m5 <- update(m4, ~.+chlorides)
m6 <- update(m5, ~.+free.sulfur.dioxide)
m7 <- update(m6, ~.+total.sulfur.dioxide)
m8 <- update(m7, ~.+density)
m9 <- update(m8, ~.+pH)
m10 <- update(m9, ~.+sulphates)
m11 <- update(m10, ~.+alcohol)
m12 <- update(m11, ~.+acidity.ratio)
m13 <- update(m12, ~.+sulfur.dioxide.per)

mtable(m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11,m12,m13)
## 
## Calls:
## m1: lm(formula = quality ~ fixed.acidity, data = reds)
## m2: lm(formula = quality ~ fixed.acidity + volatile.acidity, data = reds)
## m3: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid, 
##     data = reds)
## m4: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar, data = reds)
## m5: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides, data = reds)
## m6: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide, data = reds)
## m7: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide, 
##     data = reds)
## m8: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density, data = reds)
## m9: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH, data = reds)
## m10: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates, data = reds)
## m11: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = reds)
## m12: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol + acidity.ratio, data = reds)
## m13: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol + acidity.ratio + sulfur.dioxide.per, 
##     data = reds)
## 
## ================================================================================================================================================================================
##                            m1         m2         m3         m4         m5         m6         m7          m8           m9           m10         m11         m12         m13      
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)            5.157***   6.451***   6.450***   6.439***   6.541***   6.641***   6.663***   166.584***   182.592***   189.679***   21.965      20.931      15.031     
##                         (0.098)    (0.121)    (0.121)    (0.123)    (0.124)    (0.131)    (0.129)     (14.259)     (14.795)     (14.266)    (21.195)    (22.007)    (22.332)    
##   fixed.acidity          0.058***   0.012      0.014      0.014      0.006      0.001     -0.016        0.119***     0.174***     0.172***    0.025       0.022       0.020     
##                         (0.012)    (0.011)    (0.015)    (0.015)    (0.015)    (0.015)    (0.015)      (0.019)      (0.023)      (0.023)     (0.026)     (0.031)     (0.031)    
##   volatile.acidity                 -1.732***  -1.746***  -1.752***  -1.608***  -1.623***  -1.404***    -1.219***    -1.256***    -0.984***   -1.084***   -1.059***   -1.067***  
##                                    (0.107)    (0.127)    (0.128)    (0.130)    (0.130)    (0.131)      (0.127)      (0.127)      (0.125)     (0.121)     (0.184)     (0.184)    
##   citric.acid                                 -0.032     -0.042      0.174      0.176      0.464**      0.202        0.170        0.047      -0.183      -0.184      -0.177     
##                                               (0.152)    (0.153)    (0.159)    (0.158)    (0.160)      (0.156)      (0.156)      (0.150)     (0.147)     (0.147)     (0.147)    
##   residual.sugar                                          0.007      0.008      0.014      0.022        0.080***     0.085***     0.095***    0.016       0.016       0.015     
##                                                          (0.013)    (0.013)    (0.014)    (0.013)      (0.014)      (0.014)      (0.013)     (0.015)     (0.015)     (0.015)    
##   chlorides                                                         -2.019***  -2.005***  -2.071***    -1.195**     -0.645       -2.278***   -1.874***   -1.873***   -1.834***  
##                                                                     (0.413)    (0.412)    (0.405)      (0.398)      (0.421)      (0.431)     (0.419)     (0.419)     (0.420)    
##   free.sulfur.dioxide                                                          -0.004*     0.008**      0.006**      0.005*       0.004       0.004*      0.004*     -0.001     
##                                                                                (0.002)    (0.002)      (0.002)      (0.002)      (0.002)     (0.002)     (0.002)     (0.004)    
##   total.sulfur.dioxide                                                                    -0.006***    -0.005***    -0.004***    -0.004***   -0.003***   -0.003***   -0.002     
##                                                                                           (0.001)      (0.001)      (0.001)      (0.001)     (0.001)     (0.001)     (0.001)    
##   density                                                                                            -161.857***  -180.667***  -188.401***  -17.881     -16.824     -11.008     
##                                                                                                       (14.431)     (15.178)     (14.638)    (21.633)    (22.465)    (22.774)    
##   pH                                                                                                                 0.675***     0.625***   -0.414*     -0.422*     -0.421*    
##                                                                                                                     (0.175)      (0.169)     (0.192)     (0.198)     (0.198)    
##   sulphates                                                                                                                       1.261***    0.916***    0.914***    0.905***  
##                                                                                                                                  (0.113)     (0.114)     (0.115)     (0.115)    
##   alcohol                                                                                                                                     0.276***    0.277***    0.277***  
##                                                                                                                                              (0.026)     (0.027)     (0.027)    
##   acidity.ratio                                                                                                                                           0.001       0.001     
##                                                                                                                                                          (0.004)     (0.004)    
##   sulfur.dioxide.per                                                                                                                                                  0.332     
##                                                                                                                                                                      (0.217)    
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                  0.0        0.2        0.2        0.2        0.2        0.2        0.2         0.3          0.3          0.3         0.4         0.4         0.4    
##   adj. R-squared             0.0        0.2        0.2        0.2        0.2        0.2        0.2         0.3          0.3          0.3         0.4         0.4         0.4    
##   sigma                      0.8        0.7        0.7        0.7        0.7        0.7        0.7         0.7          0.7          0.7         0.6         0.6         0.6    
##   F                         25.0      144.3       96.2       72.2       63.4       53.9       56.0        68.5         63.1         73.6        81.3        74.5        69.0    
##   p                          0.0        0.0        0.0        0.0        0.0        0.0        0.0         0.0          0.0          0.0         0.0         0.0         0.0    
##   Log-likelihood         -1914.2    -1793.7    -1793.7    -1793.6    -1781.6    -1778.9    -1750.6     -1689.8      -1682.3      -1622.1     -1569.1     -1569.1     -1567.9    
##   Deviance                1026.1      882.6      882.5      882.4      869.3      866.3      836.2       774.9        767.8        712.1       666.4       666.4       665.4    
##   AIC                     3834.5     3595.5     3597.4     3599.1     3577.3     3573.8     3519.3      3399.5       3386.7       3268.3      3164.3      3166.2      3165.9    
##   BIC                     3850.6     3617.0     3624.3     3631.4     3614.9     3616.8     3567.7      3453.3       3445.8       3332.8      3234.2      3241.5      3246.5    
##   N                       1599       1599       1599       1599       1599       1599       1599        1599         1599         1599        1599        1599        1599      
## ================================================================================================================================================================================

Final Plots and Summary

Plot One

Description One

This plot shows the distribution of fixed acidity. It is almost a perfect bell shaped normal distribution. Most of the red wind have fixed acidity below 10.

Plot Two

Description Two

This plot shows that red wine with higher quality often has lower degree of volatile acidity. This plot is an evidence of relationship between volatile acid and red wine quality.

Plot Three

Description Three

Reflection

If I have time to do this project again I would have chosen a larger dataset with more complex data structure. The features of this dataset are almost all numerical variables with only one variable (quality) can be seen as an ordered factor variable. This is a limitation for me to explore more possibilities. Another thing I should done better was creating new variables. The two variable I created turned out are not more useful than others. Finally, although the relationship between alcohol and quality surprised me, I have to say I did not find anything as interesting as I expected.

Reference

https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html https://s3.amazonaws.com/udacity-hosted-downloads/ud651/GeographyOfAmericanMusic.html